August 2022

Working with data

  • In your career, you will have to deal with data.

  • It can take the form of sparse information (clients, patients, surveys, etc),

  • Or large pieces of information, with or without a time component (stock price, consumer behavior, social media clicks, etc).

  • As a finance person, your goals will embed the following to make more informed decisions (hopefully):

    1. Find data.
    2. Correct data.
    3. Input data.
    4. Describe data.
    5. Exploit data.

\(\rightarrow\) We treat all these points in this lecture.

Why not working with Excel?

Why not working with Excel?

What are the reasons for not using Excel?

  1. It has limited flexibility for tuned operations (histograms for instance).
  2. It can be fine-tuned with VBA, which is a different and somewhat inefficient language.
  3. Applying a complex operation on many rows of data proved complicated without VBA.
  4. It uses a lot of RAM with stupid options like auto-computation.
  5. It is more efficient to use the same software for data cleaning and analysis.
  6. Formulas in Excel are ugly (try putting IF in IF and re-read your formula on the top of your spreadsheet. I guarantee errors.)
  7. It has limited data content: 1,048,576 rows (\(2^{20}\)) and 16,384 columns (\(2^{14}\)) \(\Longrightarrow\) No more than 17,179,869,184 entries (this may seem a lot but can prove nonetheless limited).
  8. It has limited digit precision.
  9. Basic plotting options are ugly.
  10. You can’t track how people transformed the data!

Why starting with Excel still makes sense

  • The main advantage of Excel is that it’s WYSIWYG [what you see is what you get].
  • This is the easiest representation of data to deal with since you can directly read an entry.
  • You can directly visualize the structure of the data.
  • This will help you design your code better.

Panel data structure example:

  • A panel structure is when you observe \(m\) characteristics for \(n\) individuals over \(T\) periods.
  • You can represent it mostly two different ways:
    • [ time \(\times\) ( characteristics & individuals)] \(\Longleftrightarrow\) matrix of size \((T\times nm)\)
    • [( individuals & time) \(\times\) characteristics] \(\Longleftrightarrow\) matrix of size \((nT \times (m+1))\) (time is a variable)

Panel data structure: example

  • Let us look at the S&P500 stock data during 2020.
import yfinance as yf

assets = ['TSLA', 'MSFT', 'META']
prices =  yf.download(assets,
                      start='2020-01-01',
                      end='2020-12-31',
                      progress=False)
#prices.head(2)
list(prices.columns.values)
## [('Adj Close', 'META'), ('Adj Close', 'MSFT'), ('Adj Close', 'TSLA'), ('Close', 'META'), ('Close', 'MSFT'), ('Close', 'TSLA'), ('High', 'META'), ('High', 'MSFT'), ('High', 'TSLA'), ('Low', 'META'), ('Low', 'MSFT'), ('Low', 'TSLA'), ('Open', 'META'), ('Open', 'MSFT'), ('Open', 'TSLA'), ('Volume', 'META'), ('Volume', 'MSFT'), ('Volume', 'TSLA')]
prices.columns.droplevel(1).drop_duplicates()
## Index(['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume'], dtype='object')

Panel data structure: example

  • Let us look at the S&P500 stock data during 2020.
prices.head(5)
##              Adj Close                         ...    Volume                     
##                   META        MSFT       TSLA  ...      META      MSFT       TSLA
## Date                                           ...                               
## 2020-01-02  209.779999  156.592010  86.052002  ...  12077100  22622100   47660500
## 2020-01-03  208.669998  154.642166  88.601997  ...  11188400  21116200   88892500
## 2020-01-06  212.600006  155.041870  90.307999  ...  17058900  20813700   50665000
## 2020-01-07  213.059998  153.628250  93.811996  ...  14912400  21634100   89410500
## 2020-01-08  215.220001  156.075302  98.428001  ...  13475000  27746500  155721500
## 
## [5 rows x 18 columns]
prices.shape
## (252, 18)
  • The choice here is [ time \(\times\) ( characteristics & individuals)]

What we do in this lecture

In this lecture we:

  1. Learn how to load data from Excel to Python.
  2. Learn data structures and how to shape data with respect to our goal.
  3. Learn how to deal with missing data.
  4. Learn how to describe data with simple statistics.
  5. Learn how to identify outliers.
  6. Learn how to create new variables.

Dealing with data frames

Preliminary: pandas package

  • Python is originally equipped to compute.

  • It is not as efficient as Excel to create data.

  • However, it is more efficient to perform computations.

  • We need to be able to import data from elsewhere to Python.

  • pandas is the package that we need!

  • Documentation available here

import pandas as pd

pandas objects

  • Series: individual series of observations.
my_series = pd.Series([1, 3, 5, np.nan, 6, 8])
print(my_series)
## 0    1.0
## 1    3.0
## 2    5.0
## 3    NaN
## 4    6.0
## 5    8.0
## dtype: float64
  • date_range: essentially a vector of dates
my_dates = pd.date_range("20130101", periods=6)
print(my_dates)
## DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
##                '2013-01-05', '2013-01-06'],
##               dtype='datetime64[ns]', freq='D')

pandas objects

  • DataFrame: the most important object. Combines dates and series!
df = pd.DataFrame(np.random.randn(6, 4), index=my_dates, columns=list("ABCD"))
df
##                    A         B         C         D
## 2013-01-01  1.153390  0.510331  0.918271  1.356835
## 2013-01-02 -0.060675  0.039619  0.626923  0.393700
## 2013-01-03  0.339287 -0.571882  2.102990 -2.241397
## 2013-01-04  1.635909 -0.567590 -1.130645  0.475360
## 2013-01-05  0.559706  0.802252 -0.382540 -0.329580
## 2013-01-06 -1.037048 -1.185737 -1.018382 -0.082644
  • You can directly create DataFrame objects with pd.DataFrame({}) and pass it a dictionary.
  • This is unlikely to be super useful…

DataFrames: further knowledge

  • DataFrames can contain several variable types, as dictionaries.
  • This is likely to happen in practice:
    • age is numerical.
    • gender is categorical.
    • fund name is a bunch of strings.
  • You can explore variable types with dtypes
df.dtypes
## A    float64
## B    float64
## C    float64
## D    float64
## dtype: object

DataFrames: further knowledge

  • Accessing DataFrames elements can be done in several ways.
  • Typically, you want to observe a subset of variables, or a subset of dates, or both.
# Seeing the top 3 lines of DataFrame
df.head(3)
##                    A         B         C         D
## 2013-01-01  1.153390  0.510331  0.918271  1.356835
## 2013-01-02 -0.060675  0.039619  0.626923  0.393700
## 2013-01-03  0.339287 -0.571882  2.102990 -2.241397
# Seeing the bottom 3 lines of DataFrame
df.tail(3)
##                    A         B         C         D
## 2013-01-04  1.635909 -0.567590 -1.130645  0.475360
## 2013-01-05  0.559706  0.802252 -0.382540 -0.329580
## 2013-01-06 -1.037048 -1.185737 -1.018382 -0.082644

DataFrames: further knowledge

  • Selecting a particular subset
df["A"]
## 2013-01-01    1.153390
## 2013-01-02   -0.060675
## 2013-01-03    0.339287
## 2013-01-04    1.635909
## 2013-01-05    0.559706
## 2013-01-06   -1.037048
## Freq: D, Name: A, dtype: float64
df[0:2]
##                    A         B         C         D
## 2013-01-01  1.153390  0.510331  0.918271  1.356835
## 2013-01-02 -0.060675  0.039619  0.626923  0.393700

DataFrames: further knowledge

  • Selecting rows and columns based on indices
df.loc["20130102":"20130103", ["A","C"]]
##                    A         C
## 2013-01-02 -0.060675  0.626923
## 2013-01-03  0.339287  2.102990
  • Selecting all rows but a subset of columns
df.loc[:, ["A","C"]]
##                    A         C
## 2013-01-01  1.153390  0.918271
## 2013-01-02 -0.060675  0.626923
## 2013-01-03  0.339287  2.102990
## 2013-01-04  1.635909 -1.130645
## 2013-01-05  0.559706 -0.382540
## 2013-01-06 -1.037048 -1.018382

DataFrames: further knowledge

  • Selecting rows and columns based on positions
df.iloc[1:3, [0,2]]
##                    A         C
## 2013-01-02 -0.060675  0.626923
## 2013-01-03  0.339287  2.102990
  • Selecting based on conditions
df[df["A"] > 0]
##                    A         B         C         D
## 2013-01-01  1.153390  0.510331  0.918271  1.356835
## 2013-01-03  0.339287 -0.571882  2.102990 -2.241397
## 2013-01-04  1.635909 -0.567590 -1.130645  0.475360
## 2013-01-05  0.559706  0.802252 -0.382540 -0.329580

Training with Data Frames:

Let’s do an exercise:

  1. Create a DataFrame panda object containing all dates from 2020-01-01 to 2020-12-31, two columns named TSX and SPX, and random content using np.random.randn, and multiply them by \(0.1\). We take these as mimicking returns data.
  2. Select and print only the TSX data.
  3. Select and print the days at which the TSX has positive returns. Do the same with SPX. How many dates are the same?
  4. Select and print the days at which the TSX has returns below \(-15\%\). Do the same with SPX. How many dates are the same?
  5. Assume you hold a 50/50 portfolio of both indices. Using a for loop, compute the time series of your portfolio value if you start the year with $100 and keep you portfolio fixed.
  6. Make that time series the third variable in your DataFrame.

Loading data from external sources

import/export data

Loading data from Excel

  • As we mentioned, you will end up going back and forth between Excel and Python.
  • How to load data from Excel?
  • I downloaded the popular indicators from the WorldBank (found here)
  • I named it GDP.csv (in csv format) and GDP.xlsx (in Excel format).
# Loading data from .csv files
Indicators = pd.read_csv("GDP.csv")
Indicators
##                                  Series Name  ... 2015 [YR2015]
## 0                          Population, total  ...      34413603
## 1                          Population, total  ...       2880703
## 2                          Population, total  ...      39728020
## 3                          Population, total  ...         55806
## 4                          Population, total  ...         77993
## ...                                      ...  ...           ...
## 11062  Inflation, consumer prices (annual %)  ...            ..
## 11063  Inflation, consumer prices (annual %)  ...   1.431611458
## 11064  Inflation, consumer prices (annual %)  ...            ..
## 11065  Inflation, consumer prices (annual %)  ...   10.11059289
## 11066  Inflation, consumer prices (annual %)  ...  -2.430968424
## 
## [11067 rows x 20 columns]

Loading data from Excel

  • for the .xslx file:
# Loading data from .xlsx files
# You need to install library "openpyxl" beforehand...
Indicators = pd.read_excel("GDP.xlsx", "Data", index_col=None, 
  na_values=["NA"], engine = 'openpyxl')
Indicators
##                                  Series Name  ... 2015 [YR2015]
## 0                          Population, total  ...      34413603
## 1                          Population, total  ...       2880703
## 2                          Population, total  ...      39728020
## 3                          Population, total  ...         55806
## 4                          Population, total  ...         77993
## ...                                      ...  ...           ...
## 11062  Inflation, consumer prices (annual %)  ...            ..
## 11063  Inflation, consumer prices (annual %)  ...       1.43161
## 11064  Inflation, consumer prices (annual %)  ...            ..
## 11065  Inflation, consumer prices (annual %)  ...       10.1106
## 11066  Inflation, consumer prices (annual %)  ...      -2.43097
## 
## [11067 rows x 20 columns]

Writing data to Excel

  • If you perform significant transformations that take time, you might want to save your data.
  • You can do so with Excel.
  • Saving data in .csv
# Saving data in .csv files
Indicators.to_csv("GDP_2.csv")

# Saving data in .xlsx files
Indicators.to_excel("GDP_2.xlsx", "Data")
  • You can of course indicate a more complete path, otherwise it saves in your current environment.
  • Last, you can also load data directly from the internet when they allow for an API.
  • This is what we did with Yahoo finance data.
  • These are usually dealt with on a case by case basis with a particular library.

Missing data

World indicators data

  • Let us see what is inside our Worldbank data.
Indicators.columns
## Index(['Series Name', 'Series Code', 'Country Name', 'Country Code',
##        '2000 [YR2000]', '2001 [YR2001]', '2002 [YR2002]', '2003 [YR2003]',
##        '2004 [YR2004]', '2005 [YR2005]', '2006 [YR2006]', '2007 [YR2007]',
##        '2008 [YR2008]', '2009 [YR2009]', '2010 [YR2010]', '2011 [YR2011]',
##        '2012 [YR2012]', '2013 [YR2013]', '2014 [YR2014]', '2015 [YR2015]'],
##       dtype='object')
  • We directly detect that this is a panel data, where:
    • time is in the columns.
    • individuals are countries and are in rows.
    • characteristics are the different types of series.
  • Time is annual and runs from 2000 to 2015.

World indicators data

  • Which kind of countries do we have?
Indicators["Country Name"].drop_duplicates("first")
## 0                Afghanistan
## 1                    Albania
## 2                    Algeria
## 3             American Samoa
## 4                    Andorra
##                ...          
## 212    Virgin Islands (U.S.)
## 213       West Bank and Gaza
## 214              Yemen, Rep.
## 215                   Zambia
## 216                 Zimbabwe
## Name: Country Name, Length: 217, dtype: object

World indicators data

  • Which kind of variables do we have?
Indicators["Series Name"].drop_duplicates("first")
## 0                                        Population, total
## 217                           Population growth (annual %)
## 434                                  Surface area (sq. km)
## 651      Poverty headcount ratio at national poverty li...
## 868                        GNI, Atlas method (current US$)
## 1085            GNI per capita, Atlas method (current US$)
## 1302                    GNI, PPP (current international $)
## 1519         GNI per capita, PPP (current international $)
## 1736                       Income share held by lowest 20%
## 1953               Life expectancy at birth, total (years)
## 2170              Fertility rate, total (births per woman)
## 2387     Adolescent fertility rate (births per 1,000 wo...
## 2604     Contraceptive prevalence, any methods (% of wo...
## 2821     Births attended by skilled health staff (% of ...
## 3038       Mortality rate, under-5 (per 1,000 live births)
## 3255     Prevalence of underweight, weight for age (% o...
## 3472     Immunization, measles (% of children ages 12-2...
## 3689     Primary completion rate, total (% of relevant ...
## 3906                School enrollment, secondary (% gross)
## 4123     School enrollment, primary and secondary (gros...
## 4340     Prevalence of HIV, total (% of population ages...
## 4557                                  Forest area (sq. km)
## 4774     Water productivity, total (constant 2010 US$ G...
## 4991          Energy use (kg of oil equivalent per capita)
## 5208                CO2 emissions (metric tons per capita)
## 5425           Electric power consumption (kWh per capita)
## 5642                                     GDP (current US$)
## 5859                                 GDP growth (annual %)
## 6076                    Inflation, GDP deflator (annual %)
## 6293     Agriculture, forestry, and fishing, value adde...
## 6510     Industry (including construction), value added...
## 6727              Exports of goods and services (% of GDP)
## 6944              Imports of goods and services (% of GDP)
## 7161                    Gross capital formation (% of GDP)
## 7378                  Revenue, excluding grants (% of GDP)
## 7595     Start-up procedures to register a business (nu...
## 7812     Market capitalization of listed domestic compa...
## 8029                       Military expenditure (% of GDP)
## 8246        Mobile cellular subscriptions (per 100 people)
## 8463     High-technology exports (% of manufactured exp...
## 8680                          Merchandise trade (% of GDP)
## 8897          Net barter terms of trade index (2000 = 100)
## 9114        External debt stocks, total (DOD, current US$)
## 9331                         Total debt service (% of GNI)
## 9548                                         Net migration
## 9765              Personal remittances, paid (current US$)
## 9982     Foreign direct investment, net inflows (BoP, c...
## 10199            Net ODA received per capita (current US$)
## 10416                         GDP per capita (current US$)
## 10633    Foreign direct investment, net (BoP, current US$)
## 10850                Inflation, consumer prices (annual %)
## Name: Series Name, dtype: object

World indicators data

  • How many variables do we have?
Indicators["Series Name"].drop_duplicates("first").shape
## (51,)
  • We have 51 indicators for 217 countries, or 11,067 observations.
  • Not all variables are equally informed though!
ZWE_df = Indicators.loc[Indicators["Country Name"]=="Zimbabwe", :]
ZWE_df_school = ZWE_df[ZWE_df["Series Name"]=="School enrollment, secondary (% gross)"]
ZWE_df_school
##                                  Series Name  ... 2015 [YR2015]
## 4122  School enrollment, secondary (% gross)  ...            ..
## 
## [1 rows x 20 columns]
  • data seems to be missing!

World indicators data

  • Let us look at that variable
df_school = Indicators[Indicators["Series Name"]=="School enrollment, secondary (% gross)"]
df_school.loc[:, ["Country Name", "2015 [YR2015]"]].head(10)
##              Country Name 2015 [YR2015]
## 3906          Afghanistan       51.6699
## 3907              Albania       97.3885
## 3908              Algeria            ..
## 3909       American Samoa            ..
## 3910              Andorra            ..
## 3911               Angola            ..
## 3912  Antigua and Barbuda        108.23
## 3913            Argentina       106.946
## 3914              Armenia       86.0059
## 3915                Aruba            ..
  • data seems to be missing!

The problem of missing data

  • This is a standard issue with data work: data is unevenly balanced.
  • Why is this a problem?
    • You want to compute the average schooling across 217 countries.
    • You can do it for each year.
    • What if some important countries are missing? Incomplete picture.
    • Some may even be appearing and disappearing.
    • Problems may even get worse for sums instead of means.
  • Here the NA are coded with .. which is totally non-standard.
  • We replace them by standard NA.
Indicators[Indicators == ".."] = np.nan
# Alternative: Indicators.replace("..", np.nan)
df_school = Indicators[Indicators["Series Name"]=="School enrollment, secondary (% gross)"]
df_school.loc[:, ["Country Name", "2015 [YR2015]"]].head(5)
##         Country Name 2015 [YR2015]
## 3906     Afghanistan       51.6699
## 3907         Albania       97.3885
## 3908         Algeria           NaN
## 3909  American Samoa           NaN
## 3910         Andorra           NaN

The problem of missing data

The problem of missing data

  • What does NA do in a standard operation?
5 + np.nan
## nan
  • NA soak up any numerical value, so they are easy to detect!
  • BUT, most functions automatically throw them out or treat them as zeros!
df_school.loc[:, ["2015 [YR2015]"]].mean()
## 2015 [YR2015]    86.816233
## dtype: float64

Detecting missing data

  • Detection through isna
df_school.loc[:, ["2015 [YR2015]"]].isna()
##       2015 [YR2015]
## 3906          False
## 3907          False
## 3908           True
## 3909           True
## 3910           True
## ...             ...
## 4118           True
## 4119          False
## 4120           True
## 4121           True
## 4122           True
## 
## [217 rows x 1 columns]

Detecting missing data

  • Which countries have NA?
school_2015 = df_school.loc[:, ["Country Name", "2015 [YR2015]"]]
NA_countries = school_2015.loc[school_2015["2015 [YR2015]"].isna() == True,"Country Name"]
NA_countries
## 3908                  Algeria
## 3909           American Samoa
## 3910                  Andorra
## 3911                   Angola
## 3915                    Aruba
##                 ...          
## 4117                  Vietnam
## 4118    Virgin Islands (U.S.)
## 4120              Yemen, Rep.
## 4121                   Zambia
## 4122                 Zimbabwe
## Name: Country Name, Length: 75, dtype: object

[You can print them all with print(NA_countries.to_string())]

What can you do with missing data?

  • First, you have to know how much missing data there is, and where.
    • The amount of missing data could indicate that there indeed exists no data, mistakes in data collection, value has been erased by a previous user, or any type of problem.
    • Think about daily stock data, and holidays.
  • Then, you essentially have 2 solutions: drop, or input
    • drop means you’ll be dropping all missing observations from your data. This can lead you to suppress entire countries, characteristics, or time periods.
    • input is a more iteresting approach, where you replace the missing data by an educated guess.
    • This is hard, method-dependent, and likely to be wrong.

Deleting missing data

  • In case you really want to (I wouldn’t), you can delete rows and columns where there are NA.
# Deleting rows where there are at least one missing data
Indicators.dropna(axis = 0)
##                                  Series Name  ... 2015 [YR2015]
## 0                          Population, total  ...      34413603
## 1                          Population, total  ...       2880703
## 2                          Population, total  ...      39728020
## 3                          Population, total  ...         55806
## 4                          Population, total  ...         77993
## ...                                      ...  ...           ...
## 11057  Inflation, consumer prices (annual %)  ...       8.66627
## 11059  Inflation, consumer prices (annual %)  ...       2.48327
## 11061  Inflation, consumer prices (annual %)  ...      0.631201
## 11063  Inflation, consumer prices (annual %)  ...       1.43161
## 11065  Inflation, consumer prices (annual %)  ...       10.1106
## 
## [6455 rows x 20 columns]

Deleting missing data

# Deleting columns where there are at least one missing data
Indicators.dropna(axis = 1)
##                                  Series Name  ... Country Code
## 0                          Population, total  ...          AFG
## 1                          Population, total  ...          ALB
## 2                          Population, total  ...          DZA
## 3                          Population, total  ...          ASM
## 4                          Population, total  ...          AND
## ...                                      ...  ...          ...
## 11062  Inflation, consumer prices (annual %)  ...          VIR
## 11063  Inflation, consumer prices (annual %)  ...          PSE
## 11064  Inflation, consumer prices (annual %)  ...          YEM
## 11065  Inflation, consumer prices (annual %)  ...          ZMB
## 11066  Inflation, consumer prices (annual %)  ...          ZWE
## 
## [11067 rows x 4 columns]

Deleting missing data

Training to detect NAs

Practice with pandas

  1. Download and import the popular indicators from the WorldBank (found here).
  2. Transform all missing observations into NA.
  3. Select the following variables:
    • Life expectancy at birth, total (years)
    • Fertility rate, total (births per woman)
    • CO2 emissions (metric tons per capita)
    • Total debt service (% of GNI)
  4. Compute the number of missing observations for each year.
  5. Using .groupby("Country Code"), compute the number of missing observations per country per year. Which countries would you drop from the analysis?

Describing data

What’s in your data?

  • First step is always to check for missing data.

  • Second step is to know what you are observing exactly.

  • We have already seen a bit of that beforehand (number of characteristics, individuals, and time).

  • We need to investigate more.

  • READ THE METADATA \(\Longrightarrow\) it’ll tell you a description of the variables and their units.

  • Once this is done, we can start the investigation.

What’s in your data?

  • How many variables do you observe per year? What are the distributions?
Indicators.iloc[:,4:20].describe()
##         2000 [YR2000]  2001 [YR2001]  ...  2014 [YR2014]  2015 [YR2015]
## count          7460.0         7334.0  ...         8180.0         7887.0
## unique         6704.0         6863.0  ...         7434.0         7138.0
## top             100.0            0.1  ...            7.0            7.0
## freq            213.0           41.0  ...           38.0           35.0
## 
## [4 rows x 16 columns]
  • We have one problem here: Python has not recognized that the values are numbers, they are treated as objects.
Indicators["2000 [YR2000]"].dtype
## dtype('O')

What’s in your data?

  • Let’s transform the year 2000 into a numerical vector
Indicators["2000 [YR2000]"] = pd.to_numeric(Indicators["2000 [YR2000]"])
Indicators["2000 [YR2000]"].dtype
## dtype('float64')
  • Let’s do this for every year of data
column_names = Indicators.columns
for name in column_names[4:20]:
  Indicators[name] = pd.to_numeric(Indicators[name])
Indicators.dtypes
## Series Name       object
## Series Code       object
## Country Name      object
## Country Code      object
## 2000 [YR2000]    float64
## 2001 [YR2001]    float64
## 2002 [YR2002]    float64
## 2003 [YR2003]    float64
## 2004 [YR2004]    float64
## 2005 [YR2005]    float64
## 2006 [YR2006]    float64
## 2007 [YR2007]    float64
## 2008 [YR2008]    float64
## 2009 [YR2009]    float64
## 2010 [YR2010]    float64
## 2011 [YR2011]    float64
## 2012 [YR2012]    float64
## 2013 [YR2013]    float64
## 2014 [YR2014]    float64
## 2015 [YR2015]    float64
## dtype: object

What’s in your data?

  • Let’s try our describe again
Indicators.iloc[:,4:20].describe()
##        2000 [YR2000]  2001 [YR2001]  ...  2014 [YR2014]  2015 [YR2015]
## count   7.460000e+03   7.334000e+03  ...   8.180000e+03   7.887000e+03
## mean    1.513751e+10   1.627544e+10  ...   3.346946e+10   3.415759e+10
## std     2.426188e+11   2.525027e+11  ...   4.716841e+11   4.911760e+11
## min    -1.627550e+11  -3.735700e+10  ...  -1.728058e+11  -2.093630e+11
## 25%     6.656981e+00   6.264806e+00  ...   6.000000e+00   5.382360e+00
## 50%     5.837948e+01   5.951010e+01  ...   6.033187e+01   5.231245e+01
## 75%     3.702500e+03   4.540000e+03  ...   5.674048e+03   6.035000e+03
## max     1.038367e+13   1.074370e+13  ...   1.805496e+13   1.869601e+13
## 
## [8 rows x 16 columns]
  • We now have indicators of the distributions!

[Reminder: \(q_\alpha\) is such that \(\alpha\)% of observations fall below \(q_\alpha\).]

  • This is somewhat uninformative. Why?

Grouping data elements

  • We are mixing together all characteristics!
  • Let us try to perform these summary statistics across countries, for each year \(\times\) characteristic.
Indicators.groupby("Series Name").describe()
##                                                    2000 [YR2000]  ... 2015 [YR2015]
##                                                            count  ...           max
## Series Name                                                       ...              
## Adolescent fertility rate (births per 1,000 wom...         194.0  ...  1.922200e+02
## Agriculture, forestry, and fishing, value added...         184.0  ...  5.865189e+01
## Births attended by skilled health staff (% of t...         142.0  ...  1.000000e+02
## CO2 emissions (metric tons per capita)                     191.0  ...  3.247057e+01
## Contraceptive prevalence, any methods (% of wom...          78.0  ...  8.675342e+01
## Electric power consumption (kWh per capita)                140.0  ...  9.100000e+02
## Energy use (kg of oil equivalent per capita)               140.0  ...  1.747889e+04
## Exports of goods and services (% of GDP)                   176.0  ...  2.211966e+02
## External debt stocks, total (DOD, current US$)             116.0  ...  1.333777e+12
## Fertility rate, total (births per woman)                   200.0  ...  7.169000e+00
## Foreign direct investment, net (BoP, current US$)          145.0  ...  1.331628e+11
## Foreign direct investment, net inflows (BoP, cu...         189.0  ...  5.114340e+11
## Forest area (sq. km)                                       210.0  ...  8.149305e+06
## GDP (current US$)                                          197.0  ...  1.823830e+13
## GDP growth (annual %)                                      193.0  ...  2.517640e+01
## GDP per capita (current US$)                               197.0  ...  1.673133e+05
## GNI per capita, Atlas method (current US$)                 179.0  ...  1.011200e+05
## GNI per capita, PPP (current international $)              181.0  ...  1.039100e+05
## GNI, Atlas method (current US$)                            179.0  ...  1.820010e+13
## GNI, PPP (current international $)                         181.0  ...  1.869601e+13
## Gross capital formation (% of GDP)                         166.0  ...  7.940108e+01
## High-technology exports (% of manufactured expo...           0.0  ...  7.847656e+01
## Immunization, measles (% of children ages 12-23...         189.0  ...  9.900000e+01
## Imports of goods and services (% of GDP)                   176.0  ...  1.935083e+02
## Income share held by lowest 20%                             50.0  ...  1.000000e+01
## Industry (including construction), value added ...         180.0  ...  6.136236e+01
## Inflation, GDP deflator (annual %)                         192.0  ...  4.647625e+01
## Inflation, consumer prices (annual %)                      157.0  ...  1.217381e+02
## Life expectancy at birth, total (years)                    201.0  ...  8.427805e+01
## Market capitalization of listed domestic compan...          69.0  ...  1.029426e+03
## Merchandise trade (% of GDP)                               185.0  ...  3.456085e+02
## Military expenditure (% of GDP)                            145.0  ...  1.332567e+01
## Mobile cellular subscriptions (per 100 people)             204.0  ...  3.149218e+02
## Mortality rate, under-5 (per 1,000 live births)            193.0  ...  1.383000e+02
## Net ODA received per capita (current US$)                  144.0  ...  4.473376e+03
## Net barter terms of trade index (2000 = 100)               202.0  ...  2.659189e+02
## Net migration                                                0.0  ...           NaN
## Personal remittances, paid (current US$)                   144.0  ...  6.071900e+10
## Population growth (annual %)                               217.0  ...  5.790591e+00
## Population, total                                          217.0  ...  1.371220e+09
## Poverty headcount ratio at national poverty lin...          19.0  ...  6.560000e+01
## Prevalence of HIV, total (% of population ages ...         122.0  ...  2.890000e+01
## Prevalence of underweight, weight for age (% of...          56.0  ...  3.630000e+01
## Primary completion rate, total (% of relevant a...         127.0  ...  1.142725e+02
## Revenue, excluding grants (% of GDP)                        82.0  ...  1.204934e+02
## School enrollment, primary and secondary (gross...         139.0  ...  1.134040e+00
## School enrollment, secondary (% gross)                     145.0  ...  1.639347e+02
## Start-up procedures to register a business (num...           0.0  ...  2.000000e+01
## Surface area (sq. km)                                      215.0  ...  1.709825e+07
## Total debt service (% of GNI)                              112.0  ...  3.295926e+01
## Water productivity, total (constant 2010 US$ GD...           0.0  ...           NaN
## 
## [51 rows x 128 columns]

Grouping data elements

  • This is better but we cannot directly perform this stats for across countries/year.
  • If we want to use groupby, we need to change our data structure!
# Initialize our final object
#----------------------------
Final_data = Indicators.iloc[:,0:5]
# Create a variable "year"
Final_data["year"] = "2000"
# Rename the values
Final_data = Final_data.rename(columns = {"2000 [YR2000]" : "Value"})

# Loop over all remaining years
#------------------------------
for name in column_names[5:20]:
  Temp_data = Indicators.loc[:,['Series Name', 'Series Code', 
  'Country Name', 'Country Code', name]]
  Temp_data["year"] = name[0:4]
  Temp_data = Temp_data.rename(columns = {name : "Value"})
  
  # Concatenate data
  Final_data = pd.concat([Final_data, Temp_data])

Grouping data elements

  • Now we can describe our data across countries/year.
Final_data.groupby(["Series Name"]).describe()
##                                                      Value  ...              
##                                                      count  ...           max
## Series Name                                                 ...              
## Adolescent fertility rate (births per 1,000 wom...  3104.0  ...  2.171594e+02
## Agriculture, forestry, and fishing, value added...  3029.0  ...  7.904236e+01
## Births attended by skilled health staff (% of t...  1731.0  ...  1.000000e+02
## CO2 emissions (metric tons per capita)              3056.0  ...  4.769993e+01
## Contraceptive prevalence, any methods (% of wom...   679.0  ...  9.600000e+01
## Electric power consumption (kWh per capita)         2114.0  ...  5.479917e+04
## Energy use (kg of oil equivalent per capita)        2261.0  ...  2.212043e+04
## Exports of goods and services (% of GDP)            2952.0  ...  4.332235e+02
## External debt stocks, total (DOD, current US$)      1888.0  ...  1.778391e+12
## Fertility rate, total (births per woman)            3217.0  ...  7.679000e+00
## Foreign direct investment, net (BoP, current US$)   2768.0  ...  1.772770e+11
## Foreign direct investment, net inflows (BoP, cu...  3135.0  ...  7.338265e+11
## Forest area (sq. km)                                3380.0  ...  8.151356e+06
## GDP (current US$)                                   3260.0  ...  1.823830e+13
## GDP growth (annual %)                               3241.0  ...  1.231396e+02
## GDP per capita (current US$)                        3260.0  ...  1.894324e+05
## GNI per capita, Atlas method (current US$)          3031.0  ...  1.219000e+05
## GNI per capita, PPP (current international $)       3025.0  ...  1.324400e+05
## GNI, Atlas method (current US$)                     3031.0  ...  1.820010e+13
## GNI, PPP (current international $)                  3025.0  ...  1.869601e+13
## Gross capital formation (% of GDP)                  2754.0  ...  7.940108e+01
## High-technology exports (% of manufactured expo...  1174.0  ...  7.847656e+01
## Immunization, measles (% of children ages 12-23...  3053.0  ...  9.900000e+01
## Imports of goods and services (% of GDP)            2952.0  ...  4.275765e+02
## Income share held by lowest 20%                     1125.0  ...  1.090000e+01
## Industry (including construction), value added ...  3020.0  ...  8.779689e+01
## Inflation, GDP deflator (annual %)                  3240.0  ...  2.630123e+03
## Inflation, consumer prices (annual %)               2803.0  ...  5.139068e+02
## Life expectancy at birth, total (years)             3207.0  ...  8.541707e+01
## Market capitalization of listed domestic compan...  1141.0  ...  1.254465e+03
## Merchandise trade (% of GDP)                        3060.0  ...  1.170100e+04
## Military expenditure (% of GDP)                     2415.0  ...  3.265567e+01
## Mobile cellular subscriptions (per 100 people)      3275.0  ...  3.149218e+02
## Mortality rate, under-5 (per 1,000 live births)     3088.0  ...  2.277000e+02
## Net ODA received per capita (current US$)           2287.0  ...  4.473376e+03
## Net barter terms of trade index (2000 = 100)        3221.0  ...  4.585745e+02
## Net migration                                        582.0  ...  5.428756e+06
## Personal remittances, paid (current US$)            2704.0  ...  6.071900e+10
## Population growth (annual %)                        3468.0  ...  1.751221e+01
## Population, total                                   3468.0  ...  1.371220e+09
## Poverty headcount ratio at national poverty lin...   717.0  ...  8.330000e+01
## Prevalence of HIV, total (% of population ages ...  1952.0  ...  2.890000e+01
## Prevalence of underweight, weight for age (% of...   529.0  ...  4.950000e+01
## Primary completion rate, total (% of relevant a...  2075.0  ...  1.345425e+02
## Revenue, excluding grants (% of GDP)                1790.0  ...  3.415165e+02
## School enrollment, primary and secondary (gross...  2215.0  ...  1.163180e+00
## School enrollment, secondary (% gross)              2320.0  ...  1.639347e+02
## Start-up procedures to register a business (num...  2200.0  ...  2.100000e+01
## Surface area (sq. km)                               3450.0  ...  1.709825e+07
## Total debt service (% of GNI)                       1852.0  ...  5.967140e+01
## Water productivity, total (constant 2010 US$ GD...   503.0  ...  1.220277e+03
## 
## [51 rows x 8 columns]

Grouping data elements

  • Now we can perform our operations across countries for each year.
Final_data.groupby(["Series Name", "year"]).describe()
##                                                          Value  ...             
##                                                          count  ...          max
## Series Name                                        year         ...             
## Adolescent fertility rate (births per 1,000 wom... 2000  194.0  ...   217.159400
##                                                    2001  194.0  ...   216.354200
##                                                    2002  194.0  ...   215.549000
##                                                    2003  194.0  ...   214.362800
##                                                    2004  194.0  ...   213.176600
## ...                                                        ...  ...          ...
## Water productivity, total (constant 2010 US$ GD... 2011    0.0  ...          NaN
##                                                    2012  174.0  ...  1220.277310
##                                                    2013    0.0  ...          NaN
##                                                    2014    1.0  ...    59.838179
##                                                    2015    0.0  ...          NaN
## 
## [816 rows x 8 columns]

Grouping data elements

  • Or across years for each country/series pair.
Final_data.groupby(["Series Name", "Country Name"]).describe()
##                                                                          Value  ...            
##                                                                          count  ...         max
## Series Name                                        Country Name                 ...            
## Adolescent fertility rate (births per 1,000 wom... Afghanistan            16.0  ...  153.951000
##                                                    Albania                16.0  ...   20.730000
##                                                    Algeria                16.0  ...   11.755600
##                                                    American Samoa          0.0  ...         NaN
##                                                    Andorra                 0.0  ...         NaN
## ...                                                                        ...  ...         ...
## Water productivity, total (constant 2010 US$ GD... Virgin Islands (U.S.)   0.0  ...         NaN
##                                                    West Bank and Gaza      2.0  ...   20.964942
##                                                    Yemen, Rep.             3.0  ...    7.748258
##                                                    Zambia                  3.0  ...   14.642894
##                                                    Zimbabwe                3.0  ...    4.635716
## 
## [11067 rows x 8 columns]

Grouping data elements

  • You can also perform simple operations
Final_data.groupby(["Series Name"]).mean()
##                                                            Value
## Series Name                                                     
## Adolescent fertility rate (births per 1,000 wom...  5.596632e+01
## Agriculture, forestry, and fishing, value added...  1.198201e+01
## Births attended by skilled health staff (% of t...  8.923212e+01
## CO2 emissions (metric tons per capita)              4.392759e+00
## Contraceptive prevalence, any methods (% of wom...  4.953462e+01
## Electric power consumption (kWh per capita)         4.038277e+03
## Energy use (kg of oil equivalent per capita)        2.509368e+03
## Exports of goods and services (% of GDP)            4.311117e+01
## External debt stocks, total (DOD, current US$)      3.203978e+10
## Fertility rate, total (births per woman)            2.970055e+00
## Foreign direct investment, net (BoP, current US$)  -5.336153e+08
## Foreign direct investment, net inflows (BoP, cu...  9.238150e+09
## Forest area (sq. km)                                1.936816e+05
## GDP (current US$)                                   2.771587e+11
## GDP growth (annual %)                               3.812903e+00
## GDP per capita (current US$)                        1.409407e+04
## GNI per capita, Atlas method (current US$)          1.153259e+04
## GNI per capita, PPP (current international $)       1.557044e+04
## GNI, Atlas method (current US$)                     2.931429e+11
## GNI, PPP (current international $)                  4.097333e+11
## Gross capital formation (% of GDP)                  2.383623e+01
## High-technology exports (% of manufactured expo...  1.057931e+01
## Immunization, measles (% of children ages 12-23...  8.573567e+01
## Imports of goods and services (% of GDP)            4.936052e+01
## Income share held by lowest 20%                     6.641867e+00
## Industry (including construction), value added ...  2.651691e+01
## Inflation, GDP deflator (annual %)                  7.588137e+00
## Inflation, consumer prices (annual %)               6.243083e+00
## Life expectancy at birth, total (years)             6.950555e+01
## Market capitalization of listed domestic compan...  6.644808e+01
## Merchandise trade (% of GDP)                        7.313543e+01
## Military expenditure (% of GDP)                     2.002960e+00
## Mobile cellular subscriptions (per 100 people)      6.585590e+01
## Mortality rate, under-5 (per 1,000 live births)     4.224417e+01
## Net ODA received per capita (current US$)           1.281651e+02
## Net barter terms of trade index (2000 = 100)        1.126721e+02
## Net migration                                      -1.006624e+03
## Personal remittances, paid (current US$)            1.563514e+09
## Population growth (annual %)                        1.425238e+00
## Population, total                                   3.089199e+07
## Poverty headcount ratio at national poverty lin...  2.500781e+01
## Prevalence of HIV, total (% of population ages ...  2.245031e+00
## Prevalence of underweight, weight for age (% of...  1.410076e+01
## Primary completion rate, total (% of relevant a...  8.804940e+01
## Revenue, excluding grants (% of GDP)                2.723547e+01
## School enrollment, primary and secondary (gross...  9.726875e-01
## School enrollment, secondary (% gross)              7.873407e+01
## Start-up procedures to register a business (num...  8.725909e+00
## Surface area (sq. km)                               6.143649e+05
## Total debt service (% of GNI)                       4.147872e+00
## Water productivity, total (constant 2010 US$ GD...  6.453666e+01

Training to perform basic data description

Practice with pandas

  1. Start back with the worldbank data where NA has been treated.
  2. Compute the average and standard deviations of the four series across all countries for each year. How have these indicators moved through time? Comment on the heterogeneity across countries (do no forget the missing values!)
  3. Compute the same indicators across countries but across years. What is the country with the lowest fertility rate? The one with the highest? What is the country with the most variation in time?
  4. Compute the distribution quartiles for each year and variable across countries. For each year/variable pair, identify the country at 25% and 75%. Is there a lot of variation?

Identifying outliers

What is an outlier?

  • When you’re dealing with data, an outlier is you worst nightmare.
  • It’s a value that’s completely implausible, because of a mistake or a particular coding of variables that was not clear.
    • Sometimes, missing data is indicated by 99 instead of NA.
    • It can be a code for refused to answer, which is different from a lack of answer.
    • It can also be a missing or extra (couple) of zeros.
  • How to detect them?
    1. Standard statistics and plausibility assessment.
    2. Plots.

Standard statistics and graphical representation

  • Let us focus on 4 different variables for instance.
    • GDP growth,
    • Inflation (consumer prices),
    • Start up procedures to register a business (ratio of population),
    • Income share held by lowest 20%.
# Selecting the variables
Subset_data = Final_data[Final_data["Series Name"].isin(["GDP growth (annual %)",
                          "Inflation, consumer prices (annual %)",
                          "Start-up procedures to register a business (number)",
                          "Population, total",
                          "Income share held by lowest 20%"])]
Subset_data.shape
## (17360, 6)

Standard statistics and graphical representation

  • Let us compute the share of startup as fraction of population
new_variable = Subset_data[Subset_data["Series Name"]=="Start-up procedures to register a business (number)"]
Population = Subset_data.loc[Subset_data["Series Name"]=="Population, total", "Value"]

# Python can only compute operations on DF that have the same indexes to avoid mistakes
# This doesn't really work...
Population.index = new_variable.index
new_variable.loc[:,"Value"] = new_variable.loc[:,"Value"]/Population
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\core\indexing.py:1743: SettingWithCopyWarning: 
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
## 
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
##   isetter(ilocs[0], value)

Playing with indexing

  • Technically, there is a much better solution: playing with indexing.
Subset_data.index
## Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
##                 9,
##             ...
##             11057, 11058, 11059, 11060, 11061, 11062, 11063, 11064, 11065,
##             11066],
##            dtype='int64', length=17360)
# The index will be Country x Year x Series
My_index = pd.MultiIndex.from_frame(Subset_data.loc[:, ["Country Code", 
                                                        "year",
                                                        "Series Code"]])
                                                        
# Correcting the data frame
Subset_data.index = My_index
Subset_data = Subset_data.loc[:, ["Series Name", "Value"]]
Subset_data.index
## MultiIndex([('AFG', '2000',    'SP.POP.TOTL'),
##             ('ALB', '2000',    'SP.POP.TOTL'),
##             ('DZA', '2000',    'SP.POP.TOTL'),
##             ('ASM', '2000',    'SP.POP.TOTL'),
##             ('AND', '2000',    'SP.POP.TOTL'),
##             ('AGO', '2000',    'SP.POP.TOTL'),
##             ('ATG', '2000',    'SP.POP.TOTL'),
##             ('ARG', '2000',    'SP.POP.TOTL'),
##             ('ARM', '2000',    'SP.POP.TOTL'),
##             ('ABW', '2000',    'SP.POP.TOTL'),
##             ...
##             ('URY', '2015', 'FP.CPI.TOTL.ZG'),
##             ('UZB', '2015', 'FP.CPI.TOTL.ZG'),
##             ('VUT', '2015', 'FP.CPI.TOTL.ZG'),
##             ('VEN', '2015', 'FP.CPI.TOTL.ZG'),
##             ('VNM', '2015', 'FP.CPI.TOTL.ZG'),
##             ('VIR', '2015', 'FP.CPI.TOTL.ZG'),
##             ('PSE', '2015', 'FP.CPI.TOTL.ZG'),
##             ('YEM', '2015', 'FP.CPI.TOTL.ZG'),
##             ('ZMB', '2015', 'FP.CPI.TOTL.ZG'),
##             ('ZWE', '2015', 'FP.CPI.TOTL.ZG')],
##            names=['Country Code', 'year', 'Series Code'], length=17360)

Playing with indexing

  • We can now construct our variable more easily
new_variable_num = Subset_data.xs("IC.REG.PROC", level = "Series Code")
new_variable_denom = Subset_data.xs("SP.POP.TOTL", level = "Series Code")

new_variable = new_variable_num["Value"]/new_variable_denom["Value"]
new_variable
## Country Code  year
## AFG           2000             NaN
## ALB           2000             NaN
## DZA           2000             NaN
## ASM           2000             NaN
## AND           2000             NaN
##                           ...     
## VIR           2015             NaN
## PSE           2015    2.341870e-06
## YEM           2015    2.264332e-07
## ZMB           2015    4.408235e-07
## ZWE           2015    7.238696e-07
## Name: Value, Length: 3472, dtype: float64
  • Key advantage: We had the risk beforehand that countries and years would not be matched. Not anymore.

Joining our data frames

  • Now we should join our two pieces of data.
old_index = new_variable.index.to_frame()
old_index.insert(2, "Series Code", "IC.REG.PROC.PCT")
new_variable.index = pd.MultiIndex.from_frame(old_index)

Subset_data = pd.concat([Subset_data, new_variable.to_frame()])
Subset_data
##                                          Series Name         Value
## Country Code year Series Code                                     
## AFG          2000 SP.POP.TOTL      Population, total  2.077996e+07
## ALB          2000 SP.POP.TOTL      Population, total  3.089027e+06
## DZA          2000 SP.POP.TOTL      Population, total  3.104224e+07
## ASM          2000 SP.POP.TOTL      Population, total  5.781600e+04
## AND          2000 SP.POP.TOTL      Population, total  6.539000e+04
## ...                                              ...           ...
## VIR          2015 IC.REG.PROC.PCT                NaN           NaN
## PSE          2015 IC.REG.PROC.PCT                NaN  2.341870e-06
## YEM          2015 IC.REG.PROC.PCT                NaN  2.264332e-07
## ZMB          2015 IC.REG.PROC.PCT                NaN  4.408235e-07
## ZWE          2015 IC.REG.PROC.PCT                NaN  7.238696e-07
## 
## [20832 rows x 2 columns]

Back to statistics!

Back to statistics!

  • Let’s plot the distribution of our variables!
  • We need the matplotlib library
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 7})
plt.close("all")

plt.figure();
My_hist = Subset_data.hist(column = "Value", by = Subset_data["Series Name"], bins = 25)

Back to statistics

plt.show(My_hist)

  • Something weird?

Back to statistics

  • Let’s confirm with a boxplot.
# Unstacking the series level to control for the boxplot
data_to_plot = Subset_data.unstack(level = 2).loc[:,"Value"]
data_to_plot["Series"] = 1
data_to_plot.columns = ['CPI (% growth)', 'New Start-Up', 'New Start-Up (%pop)', 'GDP growth', 'Income share', 'Population', 'Series']

plt.figure();
fig, axs = plt.subplots(1,6, sharey=False)
data_to_plot.boxplot(by = "Series", ax = axs)
## array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000000033AD8588>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x000000002ED00A58>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000335A6DD8>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000335DB198>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x000000003360B518>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x000000003363E898>],
##       dtype=object)
## 
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\plotting\_matplotlib\boxplot.py:391: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes
##   **kwds,
plt.subplots_adjust(wspace = .7, hspace = .4)

Back to statistics

plt.show(fig)

  • Something weird?

Back to statistics

  • Doing the plot year by year
#Unstacking the series level to control for the boxplot
data_to_plot2 = data_to_plot.reset_index(level = 1) # filling year into a series
data_to_plot2 = data_to_plot2.loc[:,data_to_plot2.columns != "Series"] # Dropping Series

plt.figure();
fig, axs = plt.subplots(3,2, figsize = (9,5), sharey=False)
data_to_plot2.boxplot(by = "year", ax = axs)
## array([<matplotlib.axes._subplots.AxesSubplot object at 0x0000000033797B38>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x000000003371ADD8>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x00000000337A47F0>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000030928C18>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x000000003095AF98>,
##        <matplotlib.axes._subplots.AxesSubplot object at 0x0000000030995358>],
##       dtype=object)
## 
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\pandas\plotting\_matplotlib\boxplot.py:391: UserWarning: When passing multiple axes, sharex and sharey are ignored. These settings must be specified when creating axes
##   **kwds,
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
##   return array(a, dtype, copy=False, order=order)
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
##   return array(a, dtype, copy=False, order=order)
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
##   return array(a, dtype, copy=False, order=order)
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
##   return array(a, dtype, copy=False, order=order)
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
##   return array(a, dtype, copy=False, order=order)
## C:\Users\gr114\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\numpy\core\_asarray.py:83: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
##   return array(a, dtype, copy=False, order=order)
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=45)
## (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]), <a list of 16 Text major ticklabel objects>)
## (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]), <a list of 16 Text major ticklabel objects>)
## (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]), <a list of 16 Text major ticklabel objects>)
## (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]), <a list of 16 Text major ticklabel objects>)
## (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]), <a list of 16 Text major ticklabel objects>)
## (array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16]), <a list of 16 Text major ticklabel objects>)
plt.subplots_adjust(wspace = .1, hspace = 1)

Back to statistics

Time Series plots

  • Matplotlib is nice but becomes rapidly limited when it comes to various sources of plots.
  • In any case, it’s a bit hard to deal with and plots are quite ugly.
  • Alternative: plotnine!
    • This is the adaptation of ggplot package in R which creates beautiful graphs easily.
    • More help can be found here or here.
  • Installing:
# Installing plotnine using conda
conda install -c conda-forge plotnine

Time Series plots with plotnine

import plotnine as plt9

data_time_series = Subset_data.reset_index()
data_time_series = data_time_series.loc[
  data_time_series["Country Code"].isin(["CAN","USA", "FRA", "CHN"]),:]
# We need to change the type of year from "object" to "numeric" to get lines of time series
data_time_series["year"] = pd.to_numeric(data_time_series["year"])

# Change series name for beautiful printing
data_time_series["Series"] = data_time_series["Series Code"]
data_time_series["Series"].loc[data_time_series["Series"]=="SP.POP.TOTL"] = "Population"
data_time_series["Series"].loc[data_time_series["Series"]=="SI.DST.FRST.20"] = "Income share 20%"
data_time_series["Series"].loc[data_time_series["Series"]=="NY.GDP.MKTP.KD.ZG"] = "GDP groth"
data_time_series["Series"].loc[data_time_series["Series"]=="IC.REG.PROC"] = "startup"
data_time_series["Series"].loc[data_time_series["Series"]=="IC.REG.PROC.PCT"] = "startup (% pop)"
data_time_series["Series"].loc[data_time_series["Series"]=="FP.CPI.TOTL.ZG"] = "Inflation (%)"
  
my_plot9 = (
  plt9.ggplot(data_time_series) 
    + plt9.aes(x = "year", y = "Value")
    + plt9.facet_grid(facets = ('Series','Country Code'), scales = "free")
    + plt9.geom_line() 
    + plt9.theme(axis_text_x=plt9.element_text(angle=45), 
                 strip_text_y=plt9.element_text(size=5), 
                 axis_text_y=plt9.element_text(size=7))
    + plt9.labs(title = "Time Series plot", y = "", x = "")
)

Time Series plots with plotnine

## <ggplot: (97502689)>

Time Series plots with plotnine

  • You can add options very easily: dark theme
## <ggplot: (101080254)>

Time Series plots with plotnine

  • You can add options very easily: adding points
## <ggplot: (101395909)>

Time Series plots with plotnine

  • Compress one dimension but different colors
(
  plt9.ggplot(data_time_series) 
    + plt9.aes(x = "year", y = "Value", colour = "Country Code")
    + plt9.facet_grid(facets = ('Series','.'), scales = "free")
    + plt9.geom_line() 
    + plt9.theme(axis_text_x=plt9.element_text(angle=45), 
                 strip_text_y=plt9.element_text(size=5), 
                 axis_text_y=plt9.element_text(size=7), 
                 subplots_adjust={'right': 0.8})
    + plt9.labs(title = "Time Series plot", y = "", x = "")
)

Time Series plots with plotnine

  • Compress one dimension but different colors
## <ggplot: (104554998)>

Time Series plots with plotnine

  • Using bars with geom_col
## <ggplot: (101557561)>

Time Series plots with plotnine

  • Changing colors
(
  plt9.ggplot(data_time_series) 
    + plt9.aes(x = "year", y = "Value", colour = "Country Code")
    + plt9.facet_grid(facets = ('Series','.'), scales = "free")
    + plt9.geom_line() 
    + plt9.theme(axis_text_x=plt9.element_text(angle=45), 
                 strip_text_y=plt9.element_text(size=5), 
                 axis_text_y=plt9.element_text(size=7), 
                 subplots_adjust={'right': 0.8})
    + plt9.labs(title = "Time Series plot", y = "", x = "")
    + plt9.scale_color_brewer(type="seq")
)

Time Series plots with plotnine

  • Changing colors
## <ggplot: (-9223372036757269125)>

Plotting advice

Plotting advice

Data viz

  • A good data visualization is a very powerful tool!
  • Visit this website to get an idea of what powerful visualization is!
  • Spend time doing graphs and play with the options!
  • We’ll explore other data possibilities next lecture!

Exercises

Training to perform basic data visualization

Practice with pandas

  1. Start back with the worldbank data where NA has been treated.
  2. Using matplotlib or plotnine, present a boxplot of the four variables across countries and years (plotnine has the function geom_boxplot).
  3. Using matplotlib or plotnine, present a histogram of the four variables across countries and years (plotnine has the function geom_histogram).
  4. Do the same boxplots where the x-axis of each boxplot corresponds to years.
  5. Using plotnine, do the same histograms with 4 panels, where each panel presents the histograms of 2000, 2007 and 2015.
  6. Do you identify outliers from the plots? If so, identify in the data which countries contain these outliers.
  7. Plot the time series of the fertility rate on one plot, where each color is a country. What is the graphical problem here?
  8. Instead, do a barchart of the fertility rate in 2015, where the x-axis represents the country. What is the problem now?